## [1] 4898 13
There are 4898 observations with 13 variables in the dataset which are :
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The target variable is the quality which depends on the other variables. Although the scale of quality shall be between 1 and 10, the quality metric varies between 3 and 9. Median is 6 and mean is 5.878. Most of the observations probably have quality of 6 based on this distribution.
Quality variable is a discrete variable which shall be cast as an ordered factor.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
As I stated, highest count of observations have quality of 6. Most of the wines are graded 5, 6, and 7. Quality is an ordered factor variable so that I created a new variable using quality which is quality.level having values “Bad”, “Average” and “Good”. The quality values of this variable is (0,5], (5, 6], (6,9] respectively.
Let’s look deeper into the variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
As seen Fixed Acidity values are mostly fall between 6 and 7.3. Observations with fixed.acidity values less than 5 and more than 10 can be accepted as outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile acidity is positively skewed. There are outliers in observations. As stated in dataset description, both fixed and volatile acidity are related with acidity of the wine. So that there might be a relation between those two variables which will be analyzed further later in this report.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid values of wines are mostly fall between 0.49 and 0.74. Values are long tailed so that there are outliers in the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar is highly skewed. There are extreme outliers as 65.8. So that in the plot of residual sugar above the x axis is log transformed to see the dsitribution much easier.
PH values of observations are seem to have a normal distirbution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol values are skewed positively. Most of the values fall between 7 and 14.
The dataset consists of 4898 observations which includes 12 variables and the quality is the target variable (total 13) which meant to be vary between 0 and 10. The variable X is the observation number (identifier) and the rest of the variables are numerical.
The main feature of interest is quality, obviously. However, alcohol, density, acidity variables and sugar are also features of interest.
The acid-related features such as fixed acidity, citric acid, pH are of interest since thay might be related with the quality. Moreover, residual suger and the alcohol level may also be related with the level of quality.
Yes. Although quality values are numeric, those values represents an oredered factor. So I have created a categorical varaible “quality.level” which represent the quality level of wine which I have described in previous section.
Resiual sugar has a log tail distribution and the histogram was skewed. Therefore in order to investigate this variable I applied log transformation to residual sugar values.
In order to see the correlations between variables, a correlation matrix is visualized.
First of all, I have investigate the relation between quality variable and acidity variables. As seen in the correlation matrix, volatile.acidity seem to be negatively correlated with quality.
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -12.286, df = 4848, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2009321 -0.1463411
## sample estimates:
## cor
## -0.1737701
Excluding the outliers, wine quality may be negatively correalted with volatile.acidity, however we cannot say this is exact since the correlation coefficient is very low. The description of the dataset indicates that the more the volatile acidity, the taste of the wine becomes more like a vinegar. Other acidity variables do not seem to be correlated with quality.
Let’s see how alcohol level is related with quality. As seen the correlation matrix, alcohol is the highest correlated variable with quality.
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 33.279, df = 4855, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4077970 0.4536048
## sample estimates:
## cor
## 0.4309785
When outliers removed from dataset, alcholol seems to be positively correlated with quality (again Pearson Coeficcient very low to indicate a strong correlation).
Now I want to investigate the relationship between quality and density. The correlation matrix plot shows that those two variables are corelated.
##
## Pearson's product-moment correlation
##
## data: quality and density
## t = -23.578, df = 4847, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3457932 -0.2952862
## sample estimates:
## cor
## -0.3207677
Density and the quality of wine seems to be negatively correlated.
Now let’s see how the variables other than quality correlated with each other. In order to clearly see the relations between variables, the plot below is created.
The most corelated variables are “denisty” and “residual.sugar”. The correlation co-efficient for these two variable is 0.839. “total.sulfur.dioxide” and “free.sulfur.dioxide” are also positively correlated (0.616). All the variables related with acidicty are negatively correlated with pH which is a well-known fact in chemistry.
Feature of interest is selected as quality. The quality variable has highest poistive correlation with alcohol (0.436) and highest negative correlation with density(-0.307).
Yes. When all variables other than quality is investigated, density and residual sugar have strong positive correlation with the correlation coefficient of 0.839. The sulfur dioxide variables are also positively correlated with each other. pH and acidicity variables have negative correlation.
Residual sugar and density.
In the bivariete analysis, it is observed that the highest correlated variable with quality was alcohol. Now, in the multivariate analysis section, I want to investigate the effect of multiple variables on quality.
For the same density, when the residual suger increases, the alcohol content is also increases. Since we have shown that alcohol and quality are positively correlated, residual sugar and the density have also effect on the quality of wine.
I assume that every white wine contains sulfur dioxide. There are different types of sulfur dioxides in wine. In the dataset, free sulfur dioxidies and the total sulfur dioxidies are present. If we want to investigate the impact of free sulfur dioxides on quality of wine among other dioxides, we can analyse three variables “total.sulfur.dioxide”, “free.sulfur.dioxide” and “quality” variables together. As seen in the plot above, for a certain amount of sulfur dioxides, high quality wines contains more free sulfur dioxides with respect to bad ones.
I observed the relationship between the residual sugar, density and quality. For the same density, “Good” wines have more residual sugar and “Bad” ones have less. Probably, density and the residual sugar affects the level of alcohol which also affects the quality. Moreover the type of sulfur diodixes used in wine affects the quality. Free sulfur dioxides makes the quality of wine higher.
Sulfur dioxides was not seem to be related much with the quality of wine. However, in multivariate analysis, it is observed that the rate of free sulfur dioxidies is related with the quality of wine.
This plot shows that alcohol content percantage is higher in good wines with respect to average and bad wines. In the box plot, for each level of quality mean values of alcohol percantage are clearly seen.
This plot clearly shows that density and residual sugar have affect on alcohol level of wine. When the residual suger increases, the alcohol content is also increases. It is clearly seen that the good quality of wines have lower density for the same amount of residual sugar. Since we have shown that alcohol and quality are positively correlated, residual sugar and the density have also effect on the quality of wine.
When the amount of free sulfur dioxidies in total sulfur dioxidies increases, the wine quality also increases.
First of all, I investigate the distributon of the variables in the dataset. Based on the dataset description, the quality variable is the dependent variable in the dataset. When investigated, it varies between 3 and 9 based on the ratings given by people. Although those values are numerical, actually they are ordered factors. Therefore, I create categories as “Bad”, “Average” and “Good” and cut the quality values into those categories. This helped me a lot in investigating the data easier. Moreover, histograms showed that some of the variables are skewed and have outliers. In the further analysis, I take the subset of the original data by excluding the outliers.
In bivariate analysis section, I have explore the correlation between quality and the other variables in order to see the effect of the variables on quality of wine. Although there is no strong correlation between quality and other variables, the highest correlation was the alcohol level of the wine. Other than the quality, the most correlated variables were residual sugar and density, which I have investigated deeper in multivariate analysis section.
In multivariate analysis section, the relation between residual sugar, density and alcohol was very interesting. Those variables are related with each other (which was explained in detail). Residual sugar and density has affect on the alcohol directly and quality indirectly.
The toughest struggle I have faced was creating plots to reflect the relationsships between a ordered factor and continious variables. Moreover, the correlations were not clear between variables. However, in each section I have analyzed the variables their relations and come up with an idea about the effect of those variables on quality.